Academic Open Internet
Journal |
Volume 12, 2004
|
A
Simple Algorithm for Minimal Search Data Index: Active Partitioned
Data Search
Using
1 Research Scholar, Dept. of Computer Science & Engineering
2 HOD, Dept. of Computer Science & engineering
Email:rnalliah@Yahoo.co.in
Abstract:
In
this paper, we examine the issue of mining association rules among items
of a
large database of search engines. The mining of association rules can be
mapped
into the problem of discovering large item sets where a large item set
is a
group of items which appear in a sufficient number of transactions. The
problem
of discovering large itemsets can be solved by constructing an index set
of
itemsets first and then, identifying, within the subdivided set.
Generally this
is done iteratively for each large Lk itemsets are equally
partitioned into Sk subsets.
To
determine the related data of a searching text each index itemset
contains the
two candidate keys. One is moral text and another one is key that
indicates the
subset name where the data is originally located. To address this issue,
we
propose the parallel algorithm for “Minimal Search Data Index(MSDI)”
based on
the Active Partition Data Search (APDS). Extensive simulation study is
conducted to evaluate performance of the proposed algorithm
1.1 Parallel Virtual Machine
The
system called Parallel Virtual Machine
(PVM) allows a heterogeneous collection of workstations and computers to
construct a parallel processing environment. PVM is portable and runs on
a wide
variety of modern systems. The PVM system permits a heterogeneous
collection of
UNIX computers networked together to be viewed by a user program as a
single
parallel environment with some distinctive features:
A PVM computational model is shown in Figure
1.1. The dotted
lines indicate integer component communication and synchronization, and
the
solid lines indicate the inter-instance communication and
synchronization.
Input and Partitioning
Fig 1.1: PVM Computational Model
PVM
tasks may
possess arbitrary control and dependency structures. In other words, at
any
point in the execution of a concurrent application, any task in
existence may
start or stop other tasks or add or delete computers from the virtual
machine.
Any process may communicate and/or synchronize with any other. Any
specific
control and dependency structure may be implemented under the PVM system
by
appropriate use of PVM constructs and host language control-flow
statements.
Owing
to its
ubiquitous nature (specifically, the virtual machine concept) and also
because
of its simple but complete programming interface, the PVM system has
gained
widespread acceptance in the high –performance scientific computing
community.
/*database elements added into the PVM by means of n
X n
matrix format */
foreach transactions rowi
Î
SIZE do begin /*SIZE may be
network capacity */
foreach transactions colj
Î SIZE
do begin
a[rowi][col j ]
= SIZE/20
/* divide by
means of packet size */
pvm_spawn(“testdrive”,(char**)0,pvmTestDefault,”“,
NPROCS,task_id);
/* Call the test drive for
data distribution */
/*
NPROCS can be a SIZE of the subset */
rowi
++;
colj
++;
end
end
foreach roxi
Î
NPROCS do begin
/* Connection Establishment */
pvm_initsend
(PvmDataDefault);
/* Data type Implementation */
pvm_pkint
(&num_data,1,1);
/* Packing a specified data set */
pvm_pkint
(&a[num_data*rowi] [num_data*colj],num_data,1);
/* Ready for Data Transmission */
pvm_send (tast_id[rowi],1);
end
foreach rowi
Î
NPROCS do begin
/*
Recovery of specified task */
pvm-recv (task_id[I],2);
/* unpack an item for reading */
pvm_upkint
(&result [rowi],1,1);
/* Area that is used to substitute a
parallel algorithm */
end
1.2 System study
When we deeply think about recent trends of
information technology,
Database and its mining concept plays a very important role. From the
bottom
level “shop” to top level research areas, there is a database
everywhere.
But how to manage these huge
databases is a question. For that “Distributive Rules” of data mining
give the
initial solution to use a candidate key instead of searching an entire
database.
Based on this idea, we proposed the
datamining “distributive rules” based algorithm called Minimal Search
Data
Index (MSDI). In that algorithm Index itemset plays a very important
role.
Various algorithms have been
proposed to discover the large itemsets. Generally, these algorithms
first
construct a candidate set of large itemsets based on some heuristics,
and then
discover the subset that indeed contains large itemsets.
In this paper, we propose an algorithm MSDI for time reduced data search. It is very efficient in the filed of very large database management system. Here APDS algorithm is the backbone connectivity for this MSDI algorithm.
Searching For: Applications of Data Mining
Bypassed Link Ref Key I Data Mining, DSIII Ref Key II Ref Key III Ref Key IV (25 lakhs Data) Eg. Data Mining (25 lakhs Data) (25 lakhs Data) (25 lakhs Data) (More
than 1Crore Data) Search
Data Set III
DATA INDEX
Data Set IV
Data Set II
Data Set I
Huge Database
Fig 1.2 MSDI Architecture
/*Ln= Very
Large
Database Variable*/
/* Set all the database index point to BOF */
foreach data
datasub Î SIZE do begin
forall transaction
ip Î
n do begin
check
for data occurrences
if (ip eq
Li
) then do begin
Raise a flag variable with
different signals
end
end
data
++;
end
/*Li = Node that is connected with flag
variable
*/
if (flv=GREEN) then do begin
while (|
Li[
Lq=Li++;
process (L); /* Overall
Index Value
*/
forall transactions LqÎ flv do begin
if (Lq != ‘\0’) then do begin
s[Ldats
]=getc(
Ldats ++
end
end
Lqf = File
name that is
red from the index page
fp=fopen(Lqf,”r”)
While((g=getc(fp))!=EOF)
Display the
Result Page
Case I :
while (| Li[
If exists, raise the
flag as green,
else raise the red flag.
Open the
green flaged datasets and read it until the null value exists
if (Lq != ‘\0’)
Open the
file with read mode, display the contents
simultaneously in paging format.
Case 2:
Pascal Mode Generation
-------------------------------- (Data Mining)
Data set I
Moral Page
Moral
Dataset Security
DS2 Data Mining
DS3 CSMA/CD
DS4
Index Page
QAS
DS4
CSMA/CD-----------QAS
------------------------------------------- Data Mining
---------------------------------------------------------
Data set IV
Data set III
Fig 1.3 .Busy Establishment of Candidate and
Index key that
indicate suitable subset in a large database
Comparision
Experiments
Values in Milli Seconds.
Subset |
MSDI |
APDS |
Dataset
Size |
S1
(Index
Page Generation) |
0.25
|
------ |
1,27,500
items |
S2 (Subset
2) |
----- |
0.26
|
-----Do-- |
S3 (Subset
3) |
------ |
0.252 |
-----
Do ---- |
S4 (Subset
4) |
0.13 |
0.252 |
|
Total
|
0.38 |
0.779 |
|
Fig 1.4 Comparison Chart
Above table shows the relative
performance between MSDI and APDS. Here, we use | T|=15, i.e., each
transaction
has 15 double items on an average, so as to have more large itemsets in
later
passes for interest of presentation. The execution time of these two
algorithms
is shown in figure. In the phase 4 of both the algorithms are the same.
Phase 4
partitioned the database from very large database. But the other phases
are
entirely different when compared to MSDI with APDS. Every result may
result in
different microseconds. The tabulation with the two different database
sizes.
But when we go to the execution time, the time can be reduced in a large
data
set when compared to small data set. As mentioned before, in the MSDI
algorithm
has the options of switching from APDS to MSDI after early passes for
better
performance, and such an option is not adopted here. It can,
nevertheless, be
seen from figure that the execution time of the APDS is larger than the
total
execution time by MSDI.It can be seen from table that the execution time
of the
first pass of MSDI is additional step compared with APDS. However, MSDI
incurs
significantly smaller execution times than APDS in later passes, not
only in
the second pass through out the table.
The increased database sized graph
shows that the execution time of MSDI increases linearly as the database
size
increases, meaning that MSDI possesses the same important feature as
APDS.
Also, we examine the performance of MSDI as the number of items, N,
increases.
Table shows the execution times of MSDI when the number of item
increases from
1, 00,000 to 1, 50,000 for these data sets 1.14 msec and 0.94 msec
respectively.
In other words, a large transaction has a larger likelihood of having
large
itemsets to process than a small transaction. Also, given a fixed
minimum
support, when the number of items N increases, the execution time to
obtain L,
increases since the size of L1 is usually close to N1 but the execution
time to
obtain larger k-itemsets decreases since the support for an itemset is
averaged
out by more items and then decreases. Consequently, as shown in table,
when N
increases, execution time for small transactions increase a little more
prominently than those for large transactions. Performance of MSDI when
the
number of items increases
N
|
Index
|
Reference
|
1,90,000 |
0.12Msec |
0.75
Msec |
3,00,000 |
0.09Msec |
0.32
Msec |
4,50,000 |
0.06Msec |
0.14
Msec |
We examined in this paper, the issue
of mining distributive rules among various
items in
a very large database of search engine transactions. The problem of
handling
large itemsets was solved by constructing an indexed set and then
subdivided
the large database into number of small scale database. We proposed
the algorithm
MSDI is especially effective for the identification of specific item
in a very
large database. We compare this algorithm with APDS algorithm and it
concluded
that the MSDI algorithm is better than APDS in case of data items
increases.
Extensive simulation study has been concluded to evaluate performance
of the
proposed algorithm and that simulation guides “MSDI is the algorithm
that battles
with delay”.
1) R. Agrawal, T. Imielinski,
and A.
Swami, “Mining Association Rules Between Sets of Items in Large
Databases,” Proc.
1993 ACM SIGMOD Int’l Conf. Management of Data, pp. 207-216, Wahington,
D.C.,
May 1993
2)
R.
Agrawal and J.C. Shafer, “Parallel Mining of Association Rules: Design,
Implementation, and Experience,” IEEE Trans. Knowledge and Data
3)
R.
Agrawal and R.Srikant, “Fast Algorithms for Mining Association Rules,” Proc.
1994
Int’lconf. Very Large Data Bases, pp. 487-499,
4)
R.
Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 1995
Int’l Conf.
Data Eng., pp.265-276,Tucson,Ariz., May 1997.
5)
S.
Brain, R. Motwani, and C.Silverstein, “Beyond Market Basket:
Generalizing
Association Rules to Correlations,” Proc. 1997 ACM SIGNMOD Int’l Conf.
Management of Data, pp. 265-276, Tucson, Ariz., May 1997
6)
S.
Chaudhuri and U. Dayal, “An Overview of Data Warehousing and PLAP
Technology,”AGM SIGMOD Record, vol. 26, pp. 65-74, 1997.
7)
M.S.
Chen, J. Han, and P.S. Yu, “Data Mining: An overview from a Database
Perspective,” IEEETrans. Knowledge and Data Engg, Vol.8, PP.866-883,
1996.
8)
D.W.
Cheung, J. Han, V. Ng,A. Fu, and Y. Fu, “A Fast Distributed Algorithm
for
Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and
Distributed
Information Systems, PP. 1996 Int’l Conf. Data Enf., PP. 106-114, New
Orleans,
Feb. 1996.
9)
Y.
Fu and J. Han, V. Ng, A. Fu, and Y. Fu, “A Fast Distributed Algorithm
for
Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and
Distributed
Information Systems, PP. 31-44, Miami Beach, Fla.,Dec. 1996.
10) D.W.
Cheung,
J. Han, V. Ng, and C.Y. Wong, “Maintenance of Discovered Association
Rules in Large Databases: An Incremental Updating Technique,” Proc.
1996
Int’l Conf, Data Engg. PP. 106-114,
Technical College - Bourgas,